perf(vllm): optimize MiniMax M3 inference on MI300X#1782
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @Oseltamivir's task in 7m 0s —— View job Review of PR #1782LGTM — no blocking issues found. Summary: This is a well-structured incremental optimization for MiniMax M3 EP8 on MI300X. The patch adds two complementary paths (short-context native/BF16 dispatch via the oracle, long-context sparse BF16 emulation with fused SwiGLU), and the gating logic correctly ensures each path is only reached for the profiled configuration. Key observations:
|
|
Claude finished @Oseltamivir's task in 4m 57s —— View job Review of PR #1782
LGTM — no blocking issues found. Summary: This is a well-scoped incremental EP8 decode optimization stacked on #1753. The patch adds two complementary paths: short-context EP8 routes to
|
d1638a0 to
465ff47
Compare
95e79da to
27510c4
Compare
2b449ab to
87e92e2
Compare
|
Optimized MI300X-only sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27812712075 Matrix: c1, c16, and c256 for each of 1k1k and 8k1k (TP8 at c1/c16, EP8 at c256), using optimized commit |
Summary
6f5a3991), which supplies the load-time 128x128 block-FP8 conversionISL >= 8192 && CONC >= 16)This PR contains no profiling configuration and does not modify
perf-changelog.yaml.Profile basis
The final all-rank 8k1k/c256 EP profile shows one kernel stream and no compute/communication overlap window:
Native EP is 2.7% faster per profiled decode step. Across the profiled 256-request batch it improves output throughput by 6.4%, mean TTFT by 6.8%, and mean TPOT by 2.5%.
The remaining native critical path is:
The dependencies are serial at the block boundaries, so moving these kernels to another stream would not hide useful work. The implementation instead removes work from each stage and fuses only where the measured dependency permits it.
Optimizations
All optimized paths are gated to the profiled MiniMax M3/gfx942 shapes. Other models, platforms, parallel modes, and unsupported shapes retain the existing path.
Performance
MI300X output throughput, aggregate across 8 GPUs:
The action rows use the regular InferenceX sweep request count. The final row is a warmed production-image spot check with 256 fixed-length requests; the same-harness component run improved from 1,391.9 to 1,512.8 tok/s (+8.7%) before the final production-image validation.
Production-image spot checks with the exact committed patch:
Component A/B results:
Validation
vllm/vllm-openai-rocm:minimax-m34a560dd8db67c270f5e2afb614558271b76f2294git diff --checkbash -nand ShellCheckpython -m pytest utils/matrix_logic/ -v: 156 passedNote
Medium Risk
Large inference-runtime patch changes MoE routing, collectives, and model forward semantics on a gated path; wrong gating could affect numerics or parallelism, but scope is limited to profiled MiniMax M3 MI300X configurations.
Overview
Adds a second runtime patch (
minimaxm3_mi300x_profiled.patch) on top of the existing MXFP8 block-FP8 patch, and refactors the MI300X benchmark script to apply both patches generically, optionally install a pinned AITER build for TP8-only fused all-reduce + Gemma RMSNorm, and pass--max-num-batched-tokens 32768whenISL >= 8192andCONC >= 16.The patch targets profiled MiniMax M3 / gfx942 shapes: EP8 MoE route compaction and tuned block-FP8 expert configs; a gfx942 small-batch router GEMM; Triton tweaks to sparse attention and index scoring; deferred FFN all-reduces fused into the next Gemma norm boundary on TP8; and replicated input embeddings on MI300X TP8 to drop an extra collective. AITER Gemma fusion stays off for EP and non-TP8; native collectives remain there.
Gating is explicit (parallel mode, token counts, hidden size 6144, etc.) so other models and platforms keep prior behavior, with fallbacks to unfused all-reduce + norm where fast paths do not apply.
Reviewed by Cursor Bugbot for commit 87e92e2. Bugbot is set up for automated code reviews on this repo. Configure here.